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Abstract 

Let X, Z be r and s-dimensional covariates, respectively, used to model the response 
variable Y as Y = m(X, Z) + cx(X, Z)e. We develop an ANOVA-type test for the null 
hypothesis that Z has no influence on the regression function, based on residuals obtained 

CO 

■^j- ■ from local polynomial fitting of the null model. Using p- values from this test, a group 
00 ' 

• variable selection method based on multiple testing ideas is proposed. Simulations studies 

in 

suggest that the proposed test procedure outperforms the generalized likelihood ratio test 

when the alternative is non-additive or there is heteroscedasticity. Additional simulation 

^ | studies, with data generated from linear, non-linear and logistic regression, reveal that the 
?H ■ 

proposed group variable selection procedure performs competitively against Group Lasso, 
and outperforms it in selecting groups having nonlinear effects. The proposed group variable 
selection procedure is illustrated on a real data set. 
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Dimension reduction; Backward elimination. 
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1 Introduction 



Advances in data collection technologies and data storage devices have enabled the col- 
lection of data sets involving a large number of observations on many variables in several 
disciplines. When the objective of data collection is that of building a predictive model for a 
response variable, the challenges presented by massive data sets have opened new frontiers 
for statistical research. While the inclusion of a large number of predictors reduces modeling 
bias, the practice of including insignificant variables is likely to result in complicated models 
with less predictive power and reduced ability to discern and interpret the influence of the 
predictors. The underlying principles of modern model building are parsimony and sparse- 
ness. Parsimony requires simple models based on few predictors. Sparseness is a relatively 
new concept which evolved from the realization that in most scientific contexts prediction 
can be based on only a few variables. Variable selection uses the assumption of sparseness, 
enabling parsimonious model building. Thus, variable (also called feature) selection plays a 
central role in current scientific research as a fundamental component of model building. 

Due to readily available software, variable selection is often performed by modeling the 
expected response at covariate value x as m(x) = x/3. Classical approaches to variable selec- 
tion, such as stepwise selection or elimination procedures, and best subset variable selection, 
can be computationally intensive or ignore stochastic errors. A new class of methodologies 
addresses variable selection through minimization of a constrained or penalized objective 
function, such as Tibshirani's (1996) LASSO, Fan and Li's (2001) SCAD, Efron, Hastie, 
Johnstone and Tibshirani's (2004) least angle regression, Zou's (2006) adaptive LASSO, and 
Candes and Tao's (2007) Dantzig selector. A different approach exploits the conceptual con- 
nection between model testing and variable selection: dropping variable j from the model is 
equivalent to not rejecting the null hypothesis H 3 : f3j = 0. Abramovich, Benjamini, Donoho 
and Johnstone (2006) bridged the methodological divide by showing that the application 
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of the false discovery rate (FDR) controlling procedure of Benjamini and Hochberg (1995) 
on p-values resulting from testing each Hq can be translated into minimizing a model se- 
lection criterion similar to that used in Tibshirani and Knight (1999), Birge and Massart 
(2001) and Foster and Stine (2004). These criteria are more flexible than that of Donoho 
and Johnstone (1994), which uses a penalty parameter depending only on the dimensionality 
of the covariate, as well as AIC and Mallow's C p , which use a constant penalty parameter. 
Working with orthogonal designs, Abramovich et al. (2006) showed that their method is 
asymptotically minimax for i r loss, < r < 2, simultaneously throughout a range of spar- 
sity classes, provided the level q for the FDR is set to q < 0.5. Generalizations of this 
methodology to non-orthogonal designs differ mainly in the generation of the p-values for 
testing Hi : (3j = 0, and the FDR method employed. Bunea, Wegkamp and Auguste (2006) 
use p- values generated from the standardized regression coefficients resulting from fitting the 
full model and employ Benjamini and Yekuteli's (2001) method for controlling FDR under 
dependency, while Benjamini and Gavrilov (2009) use p- values from a forward selection pro- 
cedure where the ith stage p-to-enter is the ith stage constant in the multiple-stage FDR 
procedure in Benjamini, Krieger and Yekutieli (2006). 

Model checking and variable selection procedures based on the assumption of a linear 
model may fail to discern the relevance of covariates whose effect on m(x) is nonlinear. Be- 
cause of this, procedures for both model checking and variable selection have been developed 
under more general/flexible models. See, for example Li and Liang (2008), Wang and Xia 
(2008), Huang, Horowitz and Wei (2010), Storlie, Bondell, Reich and Zhang (2011), and ref- 
erences therein. However, the methodological approaches in this literature have been distinct 
from those of model checking. Working under a fully nonparametric regression model, Zam- 
bom and Akritas (2012) developed a competitive variable selection procedure by exploiting 
the aforementioned conceptual connection between model checking and variable selection. 
Their approach consists of backward elimination using the Benjamini and Yekuteli (2001) 
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method applied on the p- values resulting from testing the significance of each covariate. The 
test procedure they developed is based on the residuals obtained by fitting all covariates ex- 
cept the one whose significance is being tested. These residuals serve as the response variable 
in a one-way high-dimensional ANOVA design whose factor levels are the values of the co- 
variate being tested. By augmenting these factor levels, and using smoothness assumptions, 
they developed an asymptotic theory for an ANOVA-type test statistic. 

In many applications, covariates come in groups. For example, microarray experiments 
generate very large datasets with expression levels for thousands of genes but, typically, small 
sample size. Studies show that genes can act together as groups, and the scientific task is 
that of selecting the groups that are strongly associated with an outcome variable of interest. 
This type of problem can be addressed by first forming groups of genes through a clustering 
method and then selecting the important groups through a group selection procedure. One 
of the most common group selection procedures is the Group Lasso (Yuan and Lin, 2006), 
and the Adaptive Group Lasso (Wang and Xia, 2008). See also Park, Hastie and Tibshirani 
(2007) who, using averages of the genes within each group, perform a selection based on a 
procedure combining hierarchical clustering and Lasso. 

The first part of is paper develops an extension of the ANOVA test procedure of Zam- 
bom and Akritas (2012) to testing the significance of a group of variables under a fully 
nonparametric model which also allows for heteroscedasticity. The second part of the paper 
introduces a backward elimination procedure for group variable selection using the Benjamini 
and Yekuteli (2001) method applied on the p-values resulting from testing the significance 
of each group. 

This paper is organized as follows. Section [2] describes the proposed methodology for 
testing the significance of a group of variables, derives the asymptotic null distribution of 
the test statistic, and presents results of simulation studies comparing its performance to 
that of the generalized likelihood ratio test of Fan and Jiang (2005). Section [3] describes 
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the test-based group variable selection procedure, and presents results of simulation studies 
comparing its performance to that of Group Lasso. The analysis of a real data set involving 
gene expression levels of healthy and cancerous colon tissues is presented in Section HI 



2 Nonparametric Model Checking 

2.1 The Hypothesis and the Test Statistic 

Assume we have n observations, (Yj,Ui), i = 1, . . . ,n, of the response variable Y and 
covariates U = (X, Z), where X and Z have dimensions r and s respectively (r + s = d). 
Let m(x, z) = E{Y\K. = x, Z = z) denote the regression function. The heterocedastic 
nonparametric regression model is 



where e has zero mean and constant variance and is independent from X and Z. The goal 
is to test the null hypothesis that Z does not contribute to the regression function, i.e. 



The idea for testing this hypothesis is to treat the covariate values Zj, i = 1, . . . , n, as the 
levels of high- dimensional one-way ANOVA design, with the null hypothesis residual £j = 
Yi — m^Xj) being the observation from factor level Zj, and construct an ANOVA based test 
statistic. Because the asymptotic theory for high-dimensional ANOVA requires more than 
one observation per factor level (Akritas and Papadatos, 2004), we will employ smoothness 
conditions, which will be stated below, and augment each factor level by including residuals 
from nearby covariate values. With a univariate covariate, such factor level augmentation 
was carried out in Wang, Akritas and Van Keilegom (2008) and Zambom and Akritas (2012) 
by ordering the covariate values and including in each factor level the residuals corresponding 



Y = m(X,Z) + a(X,Z)e, 



(1) 



H : m(x, z) = m^x). 



(2) 
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to neighboring covariate values. With a multivariate covariate, the challenge is to order the 
factor levels, and hence the residuals, in a meaningful way resulting in a test statistic with 
good power properties. To do so, we propose to replace each Zj by a nonlinear version of Bair, 
Hastie, Paul and Tibshirani's (2004) first supervised principal component (PC), Pg^ = ZfCg. 
The subscript 9 will be explained below when the supervised PC is introduced. 

Having a univariate surrogate of Z, we augment each cell Pg^ = ZjCg by including 
additional p — 1, for p odd, residuals ^ which correspond to the p — 1 nearest neighbors Pg^ 
of Pgj. To be specific, we consider the (£i,Pe,i), i — 1, . . . ,n, arranged so that Pg^ < Pg,i 2 
whenever z'i < z 2 , and for each Pg^, (p — l)/2 < i < n — (p—l)/2, define the nearest neighbor 
window Wi as 

Wi{Cg) = jj : \F P (Pe d ) - F P (Pg,)\ < ^} , (3) 

where F P is the empirical distribution function of Pg : \, . . . , Pg tTl . WiiCg) defines the aug- 
mented cell corresponding to Pg^. Note that the augmented cells are defined as sets of 
indices rather than as sets of £j values. The vector of (n — p+ l)p constructed "observations" 
in the augmented one-way ANOVA design is 

^C fl = £ W {p - 1)/2+1 (Cg), . . G W n _ {p „ 1)/2 (Cg)) T . (4) 

Let MST and MSE denote the balanced one-way ANOVA mean squares due to treatment 
and error, respectively, computed on the data £ Cfl . The proposed test statistic is based on 

MST - MSE. (5) 

In this paper the residuals ^ — Yi — m^Xj), i = 1, . . . , n, will be formed using the local 
polynomial of order q regression estimator defined by 

n 

mi(X0 = ef (3^ i W x< Xx i )~ 1 X^Wx i Y = ^ «5(X<, X,)^, i = 1, . . . ,n, (6) 
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where W x = diag{K Hn {*x ~ x), . . . , K Hn (X n - x)}, with K Hn {*) = \H n \~ l l 2 K{Hn l/2 ^ 



for K(-) a bounded, non- negative r-variate kernel function of bounded variation and with 

1/2 

bounded support and H n is a symmetric positive definite r x r bandwidth matrix, and 

f l (Xi-xf vech r {(Xi -x)(Xi -x) T } ...^ 
— : : : . . . , 

v l (X n -x) T vech T {(X n -x)(X n -x) T } ... j 

with vech denoting the half-vectorization operator, is the n x j r ^ q design matrix, where 

i j j 

j=0 fei=0 k r =0 
k 1 +...+k r =j 

We finish this section with a description of the construction of the first non-linearly su- 
pervised principal component Pg = Z T Cg. Let pj,j — 1, . . . , s, denote the p-values obtained 
by applying the test of Zambom and Akritas (2012) for testing the hypothesis H 3 Q which 
specifies that Zj, the jth coordinate of Z, has no effect on the regression function of the 
model with response variable Y and covariate vector (X, Zj). For a threshold parameter 
9, define the index set J = {j : Pj < 6} and let Zj be the vector formed from the J 
coordinates of Z. Then, Pg = Z T Cg is the first principal component of Zj. Note that some 
entries of Cg are equal to 0, corresponding to the coordinates of Z with pj greater or equal 
to 9. It is important to keep in mind that the observable vector of first nonlinear principal 
components, Pg = (Pg,i, ■ ■ ■ , Pe,n), depends on the estimated residuals £j, % = 1, . . . , n, to the 
extend that J7", and hence Cg, depend on them. 



2.2 Asymptotic null distribution 

Theorem 2.1. Assume that the marginal densities /x ; fz o/X, Z, respectively, are bounded 
away from zero, the q + 1 derivatives of m\ (x) are uniformly continuous and bounded, that 
cr 2 (.,z) := 2£(£ 2 |Z T C) is Lips chitz continuous, sup x z a 2 (x, z) < oo ; and E{ef) < oo. Assume 
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that the eigenvalues, A^, i — 1, . . . ,r ; 0/ i/ie bandwidth matrix HJ converge to zero at the 
same rate and satisfy 

nAf (9+1) ^0 and * -)• oo, f = l,...,r. (7) 
(logn) 2 

Then, under H in (TJ)), the asymptotic distribution of the test statistic in (TJp is given by 

n l ' 2 {MST - MSE) A N(0, 2p{2p ~ ] ] r 2 ) , 

3(p- 1) 

where t = J [Ja 2 (x,z)f^ lZ T C=z T C (x.)dyi] 2 f Z T C (z T C)d(z T C). 

An estimate of r 2 can be obtained by modifying Rice's (1984) estimator as follows 
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' 4(n - 3) . =2 

Asymptotic theory under local additives and under general local alternatives is derived 
in Zambom (2012). As these limiting results show, the asymptotic mean of the test statistic 
MST — MSE is positive under alternatives. Thus, the test procedure rejects the null 
hypothesis for "large" values of the test statistic. 



2.3 Simulations: Model Checking Procedures 

We compare the proposed ANOVA-type hypothesis test for groups with the generalized 
likelihood ratio test of Fan and Jiang (2005). The data is generated under three situations: 
a homoscedastic additive model, a homoscedastic non-additive model, and a heterocedastic 
non-additive model. All covariates, in all models, are independent standard normal. The 
homoscedastic additive model is 

Y = X 1 + 9(Z 1 + Z 2 + Z 3 ) + e, where e~JV(0,l), (9) 

the homoscedastic non-additive model is 

Y = X^(l + 9(Z 1 + Z 2 ))+X e 2 {Zl+Z2) + e, where e~iV(0,.l 2 ), (10) 
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and the heterocedastic non-additive model is 

Y = X 1 + 0sm(Z 1 Z 2 ) + Z x Z 2 e, where e~iV(0,.5 2 ). (11) 

In each situation we simulate 2000 data sets ot size n = 200. All simulations were performed 
in R. 

In order to evaluate the effect of the threshold parameter 9 we applied our test procedure 
with 9 = 0.05 and 9 = 0.2. Moreover, in each case we considered two rules to form the set of 
covariates from which the first supervised principal component is obtained. Rule 1 consists 
of using only the covariates with p- value less than 9, and in Rule 2 we consider the set of 
covariates chosen from Rule 1 and add to the set the covariate with the smallest p-value 
among the remainder covariates. In each case, if the number of selected covariates is less 
than two the set is formed from the two with the smallest p-value. Thus the simulations 
consider four versions of our test statistic: a) Rule 1 with 9 = 0.05, b) Rule 1 with 9 = 0.2, 
c) Rule 2 with 9 = 0.05, d) Rule 2 with 9 = 0.2. All four versions of our test statistic use 
windows of p = 11. 

Tables [fl [2J and El show the simulation results for models fl9|).f lT0|) . and f jTTj) . respec- 
tively. It is seen that the proposed test procedure is robust to the choice of the threshold 
parameter, and to the rules for selecting the set of covariates from which the first supervised 
principal component is obtained. The Generalized Likelihood Ratio test, which is designed 
for homoscedastic additive models, achieves better power under model ([9]), but is extremely 
liberal under heteroscedasticity and its power for the non-additive alternatives of model ( TTTT> 
is mainly less than its level; see Table |3j Table [2] suggests that the GRLT has low power 
against non-additive alternatives even in the homoscedastic case. 
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Table 1: Rejection rates for the homocedastic additive model 
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Method 







.2 


.4 


.6 


.8 


ANOVA-type- 


•a 


.066 


.404 


.691 


.706 


.751 


ANOVA-type- 


•b 


.060 


.378 


.613 


.689 


.733 


ANOVA-type- 


■c 


.066 


.396 


.600 


.692 


.749 


ANOVA-type- 


■d 


.057 


.375 


.618 


.685 


.718 


GRLT 




.048 


.883 


1 


1 


1 



Table 2: Rejection rates for the homocedastic non-additive model 

6 = 



Method 







.02 


.04 


.06 


.08 


ANOVA-type- 


•a 


.051 


.202 


.522 


.693 


.724 


ANOVA-type- 


•b 


.047 


.192 


.560 


.710 


.739 


ANOVA-type- 


■c 


.050 


.193 


.520 


.679 


.711 


ANOVA-type- 


■d 


.047 


.161 


.510 


.676 


.733 


GRLT 




.052 


.059 


.117 


.235 


.379 



3 Nonparametric Group Variable Selection 

In this section we will present a test-based group variable selection. For this purpose we 
will make a slight change in notation by letting X denote the entire vector of covariates. 
Thus, we consider the nonparametric regression model 

Y t = m(Xi) + a(Ki)si, i = l,...,n, (12) 

where e, is the independent error with zero mean and constant variance. Suppose that the 
covariates are classified in d groups identified by the indices Je = {j : Xj belongs to group £}, 
I = 1, . . . , d, and let si denote the size of group Jg. Moreover, we assume sparseness in the 
sense that only the variables in a subset I = { J±, . . . , J<f } C { J\, . . . , Jd} of the groups 
influence the regression function. Finally, we will assume the dimension reduction model of 
Li (1991), i.e. 

m(x) = (yf(Bx), where B is a K x Sj) matrix. (13) 
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Table 3: Rejection rates for the heterocedastic non-additive model 
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Method 







.3 


.6 


1 


2 


ANOVA-type 


-a 


.035 


.168 


.503 


.654 


.789 


ANOVA-type 


-b 


.040 


.172 


.529 


.663 


.767 


ANOVA-type 


-c 


.037 


.161 


.501 


.651 


.757 


ANOVA-type 


-d 


.036 


.190 


.520 


.657 


.742 


GRLT 




.584 


.585 


.535 


.439 


.297 



Define the hypothesis 

H o '■ m ( x ) = m i( x (-J f )) ; £ = l,...,d, 

where X-(-j t ) is the set of all covariates except those whose index are in Jg. Under the 
dimension reduction model (fl3]) . this hypothesis can be written equivalently as 

^:#x)= 5 (B ( _ J() x ( . J() ), £=l,...,d, (14) 

where "Bf-jA is the K x (d — s^) matrix obtained by omitting the columns of B with indices 
in J(. Let B denote the Sliced Inverse Regression (SIR) estimator of B, and B(_j/) be the 
corresponding submatrix. With this notation, let 

z t = yJn(MST e - MSE e ) / 

be the test statistic for testing the hypothesis ( IT4|) with B(_^)X(_j^) playing the role of X 
in Theorem 12. 1[ and Bo^X^) playing the role of Z, where X^) is the set of all covariates 
whose index are in Je and ~B(j t ) is the corresponding submatrix of B. 

In this context we will describe the following group variable selection procedure using 
backward elimination based on the Benjamini and Yekuteli (2001) method for controlling 
the false discovery rate (FDR): 

1. Compute the p-value for Hq as Tig = 1 — $(^), £ — 1, . . . , d. 



'2p(2p 



3(p-l) 



f 4 



10 



2. Compute 

k = max < i : irm < — -5 > (15) 

for a choice of level a, where 7T(i), . . . , 7T(d) are the ordered p- values. If = d stop and 
retain all groups. If k < d 

(a) update x by eliminating the covariates of the group corresponding to ir^, 

(b) update d to d — 1, 

(c) update B by eliminating the columns corresponding to the deleted variables, 

(d) update the test statistic zi, £ = 1, . . . , d. 

3. Repeat steps [U and [2j with the updated zi, £ = 1, . . . , d. 

Remark Another approach for constructing a group variable selection procedure is to use 
a single application of the Benjamini and Yekuteli (2001) method for controlling the false 
discovery rate (FDR). This is similar to one of the two procedures proposed in Bunea et 
al. (2006). However, this did not perform well in simulations and is not recommended. A 
backward elimination approach was used in Li, Cook and Nachtsheim (2005), but without 
incorporating multiple testing ideas. 



3.1 Simulations: Variable selection procedure 

In this section we compare the variable selection based on the ANOVA-type test to the 
Group Lasso proposed by Yuan and Lin (2006). We study the behavior of the selection for 
two different scenarios, one with a continuous response and another with a binary response. 
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For the continuous response scenario the data is generated according to the models 

Model 1 : Y = Xf + X 3 2 + X 3 + (1/3)X 6 3 - X 2 + (2/3)X 6 + e 
Model 2 : V = sm(X 3 3 + X 3 2 + X 3 ) + (1/3)X 3 - X 6 2 + (2/3)X 6 + e 
Model3:r = 10sm(X 3 3 + X 2 + X3) + 5sm((l/3)X 3 -X 2 + (2/3)X 6 ) + e 

where X = (Z t + W)/y/2, Z u % = 1, . . . , 16 and W iid iV(0, 1), and e ~ iV(0, 2 2 ). Thus, for 
Models 1, 2 and 3 there are 16 groups of three covariates each, represented by the polynomial 
terms. The only groups that are significant are groups 3 and 6. We run 1000 simulations of 
data sets of size n = 100. Table |4] shows the mean number of correct and incorrect groups 
selected by the ANOVA-type variable selection and Group Lasso using the C p criterion. It is 
seen that Group Lasso tends to select more groups that are not significant to the regression, 
while both methods perform competitively in selecting the significant groups. 

Table 4: Results for the ANOVA-type and Group Lasso 



Model 


Method 


Corr. Selected 


Incorr. Selected 


Model 1 


ANOVA-type 


1.80 


.55 




Group LASSO 


2 


4.7 


Model 2 


ANOVA-type 


1.15 


.81 




Group LASSO 


1.59 


4.21 


Model 3 


ANOVA-type 


1.84 


0.64 




Group LASSO 


1.80 


6.75 



For the second scenario, we consider the following three logistic regression models. 

Models 1 and 2 : Pj (X) = =- — , j = 1, 2, 

where X = (Xi, . . . , A 15 ) are iid U(0, 1), grouped sequentially in 5 groups of 3 covariates 
each, and 

(3, = (1,-2.2, 2, 0,0, 0,0, 0,0, 0,1, 2, 0,0,0, Of 
(3 2 = (1,-2.2,3,0,0,0,0,0,0,0,1,3,0,0,0,0) T . 

1 



Model 3 : p 3 (X) 



1 + exp (-18 sin(7rV 2 ) - 18 sin(7rX 8 )) ' 
12 





Table 5: Results 


for logistic reg 


ression 


Model 


Method 


Gorr Selected 


Tncorr Selected 


Model 1 


ANOVA-type 


.340 


.261 




Group LASSO 


.197 


.032 


Model 2 


ANOVA-type 


.287 


.312 




Group LASSO 


.100 


.021 


Model 3 


ANOVA-type 


1.223 


0.080 




Group LASSO 


0.040 


0.039 



where X = (Xi, . . . , X 12 ), with X\, . . . , X n iid U (0, 3) and X 12 ~ iV( — 3, 1) independent of 
the others, are grouped sequentially in 4 groups of 3 covariates each. 

The results in Table [5] are based on 1000 simulation runs using n = 100 for Models 1 
and 2, and n = 200 for Model 3. It is seen that for Models 1 and 2 the number of correctly 
selected covariates by either procedure is low. This is probably due to the smaller sample size 
and the larger number of covariates. For Model 3, the Group Lasso fails to select covariates, 
while the ANOVA-type procedure seems to perform very well. In summary, the simulation 
results suggest that the ANOVA-type variable selection procedure outperforms the Group 
Lasso when the logistic regression model involves a non-linear function of the covariates, and 
has competitive performance in the other cases. 



4 Real Data Example 

The proposed procedure will be illustrated with an analysis of the colon cancer dataset 
of Alon et al. (1999). The dataset was obtained from the Affymetrix technology and shows 
expression levels of 40 tumor and 22 normal colon tissues of 6,500 human genes. A selection 
of 2,000 genes with highest minimal intensity across the samples has been made by Alon et 



al. (1999) and is publicly available at http://microarray.princeton.edu/oncology. Different 



clustering methods have been applied to this data set in several previous studies including 
Dettling and Buhlman (2002, 2004), and Ma, Song and Huang (2007). 
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To illustrate the proposed ANOVA-type group variable selection procedure we first apply 
a clustering method to form the groups. We chose the supervised clustering procedure Wilma 
proposed by Dettling and Buhlman (2002) which is available in the package supclust in R. 
Wilma requires as input the number of clusters to be formed, and we specified 60, 55, 50 and 
45 clusters. The next step of the proposed procedure requires dimension reduction through 
SIR. However, because the number of genes (2,000) is much larger than the sample size (62) 
it is not possible to use SIR straightforward. Therefore to estimate B, we ran SIR on the 
set of predictors composed of the first supervised principal component of each cluster. 

We also ran the Group Lasso procedure for binary responses using the package grplasso in 
R on the same clusters/groups returned by Wilma. However, a corresponding modification 
is needed for the calculation of the degrees of freedom needed for the application of the C p 
criterion; see Yuan and Lin (2006). This calculation requires the estimator (3 from fitting all 
individual covariates. Since the number of covariates is much larger than the sample size, 
we obtain an approximation to the required estimator by first obtaining the estimator ftp 
from fitting the first PC from each cluster. Since the PCs are linear combinations of the 
covariates in each cluster, having a coefficient for a cluster's PC translates into coefficients 
for the covariates in that cluster. 



Table 6: Results for Colon data set 



No. Initial Clusters 


Procedure 


Clusters Selected 


60 


ANOVA-type 


1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 15, 16, 
18, 23, 24, 25, 28, 31, 33, 34, 39, 42, 46 




Group Lasso 


1, 2, 3, 8, 9, 16, 17, 37 


55 


ANOVA-type 


1, 2, 3, 4, 5, 6, 8, 10, 11, 12, 15, 16, 
17, 22, 23, 24, 25, 26, 33, 38, 42, 45 




Group Lasso 


2, 3, 7, 9, 13, 18 


50 


ANOVA-type 


1, 2, 3, 4, 5, 6, 8, 10, 11, 12, 14, 15, 16, 
18, 21, 23, 24, 25, 26, 36, 40, 45, 47, 48, 49 




Group Lasso 


2, 3, 7, 9, 27, 29 


45 


ANOVA-type 


1, 2, 3, 4, 5, 6, 8, 10, 11, 12, 15, 16, 
18, 21, 22, 23, 24, 25, 31, 34, 37, 40, 42, 44, 45 




Group Lasso 


2, 3, 7, 9, 10, 16 
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Figure 1: Plot of Group Lasso Logit Estimate Against the First PC of Clusters 4 and 2 

Table [6] shows the groups selected by the proposed group variable selection and the group 
lasso procedure for the different specified number of clusters returned by Wilma. 

We note that there is significant overlap in the genes included in the clusters selected 
by each method across the different numbers of total clusters specified. Thus, the different 
number of total clusters specified is not critical for selecting the important genes. The 
proposed method selects more clusters, which is contrary to the simulation results. This is 
probably due to the linear link function for the logit used in grplasso. For example, when 
the logit of the fitted probability of cancerous tissue is plotted against the first PC of cluster 
4, which is selected by the proposed method but not by Group Lasso, it shows a nonlinear 
effect (left panel of Figured]), whereas the non-linearity is much less pronounced when plotted 
against the first PC of cluster 2, which is selected by the proposed method and by Group 
Lasso (right panel of Figure [TJ) . 
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A Appendix 



Proof of Theorem \2.1[ Under H in fl2]) we write 



ii = y r i -mi(X i ) + m 1 (X i )-mi(X i )=ei-(mi(X i )-mi(X i )) 
= & - A mi (Xj), 

where A mi (Xj) is defined implicitly in the above relation. Thus, £ Cfl of relation (j4]) is 
decomposed as £ Cfl = £ c — A mi c e , where £ Cf) and A mi c e are defined as in (BJ but using ^ 
and A mi (Xj), respectively, instead of Note that MST-MSE given in dSJ) can be written 
as a quadratic form £c e ^£c<3 ( see Wang, Akritas and Van Keilegom, 2008), where 

1 , 1 



np — 1 

= n(n-l)p(p-l) ® t=1 P ~ ~~ nv ~ ^ np ' 



(16) 



n(n — l)p n(p — 1) 
Id is a identity matrix of dimension d, 3d is a dxd matrix of l's and © is the Kronecker sum 
or direct sum. Thus, we can write \/n(MST - MSE) as 



(17) 



That y/n2$ > Q e AA miCg and y/nA miC AA mi c g converge in probability to uniformily 
follows from arguments similar to those used in Zambom and Akritas (2012). 

Using Corolary IA.ll to show the asymptotic normality of \/n£c g A£ c , it is enough to 



show that 



sup 
c 



P 



2p(2p-l) 2 
3(p-l) 



0. 



Let b n ~ ra 2//3 and r n ~ n/6 n ~ n 1 / 3 and write 

n 



'n c — ' wn 

i=i v i=i 

1 1 

—=Su{Cq) H — y=Sv(Ce), 
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where, with 7 f (C fl ) = J2j^j 2 ijAh 1 (h, h e Wi(C e )), 



UiiCe) = 7(i_i)(fe n+p ) + i(C ) + . . . + 7(i_i)( 6 „ + p) +6 „(C 6 
Vi{C e ) = 7(i_i)(b„ +p ) + 6 n+ i(C e ) + . . . + 7i(6„+p)(C e )- 



Note that the U^Cq) are independent, and the Vi(Gg) are independent. 
Now, letting sd = 



sup 
c 



sup 
c 



sup 
c 



2 3fr!-i) 1) r2 ' we have 



I y/n£ c A d $ c < t 
sd 

Su(C) + S V (C) ^ t 



«&(*) 



y^sd 

\ y^sd ~~ 



A/rasd ' 



$(*) 



S'vfC] 



\ vnsd 



y^sd 
SV(C 



< e 
S'v(C) 



A/nsd ' 



^/nsd 



> e 



$(t) 



< sup 
c 



p(%M< ( 

\ ynsd 



Sy(C) 
y/nsd ' 



S'vfC) 



V^sd 



< e 



+ sup P 



Sv(C) 



V^sd 



> e 



(19) 



That the second in (fT9l) term converges to zero follows from Lemma [A. 21 That the first term 
in (I19jl converges to zero follows from Lemma IA.31 provided we show that 



( ^ u ^ i _ - ) — y for any C. 



(20) 



By dUD, and because A 0, (p0|) follows from sup c Vav(s/n^ c A d ^ vC ) -»■ sd 2 . By the 

definition of £c e ^£cv ^ is eas y to see 

that E (€lA d £ c ) = for any C. To find the variance 
of y/E^ c A4 vC we first evaluate the conditional second moment E^^n^^Ad^c) 2 ^ 1 C}. 



17 



Recalling the notation cr 2 (.,zJC) = P(£ 2 |Z T C = zJC), we have 
1 

n{p - 1) 



j 

n n n 



r^E E E ^teA^a 2 |Z T C)/( Js e WJ.(CH e W la (C),s = 1,2) 



2 

sup 
c 



n n n 



sup 
c 



n(p — 


I) 2 


2 






I) 2 


2 




n(p — 


I) 2 


2 




7l(p — 


I) 2 


2 




n(p — 


I) 2 



11=1 12=1 j^=l 
n n n 



n it, n / / \ \ 

E E E °*(-> z J c ) ( ^ z J^ + °p[ J h)) ^ 1 G w * (°) n ^ (°)) 



s * p tT^tp £ ^ z , T c) E E E 7 C?> ' e ^ (o n ^ (O) + ^ 

C ^ J 3=1 i 1= ii 2 =i \ / 



s c p M^TF ^ a4( " z ' C)2(1 + 2 " + 3 " + - + {p ~ lf) + ° p 



.7=1 V 7 



= sup 
c 

3 

where the third equality follows from Lemma IA.5I using the assumption that a 2 (.,zJC) is 
Lipschitz continuous and the second last inequality results from the fact that if 1 < \j\ — j'2 1 = 
s < p — 1, then they are (p — s) 2 pairs of windows whose intersection includes ji and ji- 
Taking limits as n — > 00 it is seen that 

From relation (f2~T|) it is easily seen that sup c P[(a/t7£cA2£c) 2 |Z t C] remains bounded, and 
thus sup c Var(n 1 / 2 ^^4Cc) also converges to the same limit by the Dominated Convergence 
Theorem. □ 

Lemma A.l. If the assumptions of Theorem \2.1\ hold, then under H Q and as n — >■ oo, 

supP (n 1 / 2 \&,Mc, ~ €c.A£c.\ > e) 0, (22) 
w/iere A d = diag{B ly luz'rj/j P* = [J p - I p ]. 

Proof. By Chebyshev Inequality, we have that 

, ^ [fto^c " &Uc) 2 " / 

SUpP ^T^ c _ £^£ c > e < gup L 1 (23) 

c c e 
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Since the block diagonal elements of Ad equal those of A, it sufices to show that the off 
diagonal blocks of A are negligible. For ij ^ i 2 , every element of the block (11,12) equals 
n (n-i)p ' ^ e sri ow that the second moment on the right hand side of (1231) conditionally 
on Z goes to zero, and therefore the unconditional second moment also does. To that end, 
write 



nE 



sup 
c 



n 



(£c^c-&U c ) a |z 



(24) 



n[n 



n 



n(n 



sup^f^^ J2 &&t»bJVkeW ih (C),k=l,...,4)\z\ 

1 \ 2 n 

rrr) 8 S P ?? ^. J5 ^^i z ) j ^* gW *^ c )' A! = 1 '---' 4 ) 

«1^«2 *37^4 jl J2J3J4 = 1 



The expected value in this sum is different from zero, only if . . . , £j 4 consists of two pairs 
of equal observations, or ji = j 2 = J3 = J4. Since there are 0(n 2 p A ) terms for the former case 
to happen and 0(np 4 ) for the latter case to happen, and the magnitude of these terms is not 
affected by C, the order of (1241) is O {j;^h^i n2 P^ — °(1)> an d this completes the proof. □ 

Corollary A.l. Let Ad = diag{Bi, B n } , with Bi 



l p ], sd 



2p(2p-l) 



n(p-l)L"P '■Pi) - y 3(p-l) 

and £ c 6e defined in OJ) u>zt/i C instead of Cg . Then, under the assumptions of Theorem 
\2.1\ we have 



sup sup 
c t 



sup sup 
c t 



l"' /2 ^°< f 

set 

F|°' ,i ^<» 
sd 



*(t) 
-*(t) 



if and only if 
0. 



Proof. Write 



11 



l/2tT 



V2 



sd 



sd 



sd 
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Now, for any t 



sup 
c 



sd 



< t - m 



sup 
c 



p 



n 



sd 



< t 



n 



V2 (CT 



{fcMc - fcAdc) 



sd 



n 



V2 (fiT 



sd 



+P 



ii 



V2 ftT 



sd 



< sup max 
c 



+ sup P 
c 



p 



< t - 



n 1/2 f c A d Zc 
sd 



sd 

<t + e] -$(*) 



ii 



V2 (tT 



sd 



< e 



> e - *(t) 



P 



™ 1/2 £cAi£c 



™ 1/2 (^c - £^c) 



sd 



>e . 



(25) 



The last term in ( |25i) goes to zero by Lemma IA.1I Thus, 



sup sup 
c t 



p 



sd 



< t - $(t) 



< sup max < sup 
c t 



<t) -$(t-e) 

sd 



sup 

t 



< t) -$(t + e) 

sd 



< sup sup 
c t 



p 



- < t - 



sd 



sup|$(t) - $(t + e)| +o(l). 

t 



Letting e — ^ 0, 



lim sup sup 

n ^°° c t 



sd 



< lim sup sup 

n ->°° c t 



sd 



Using similar steps, it can be shown that 



lim sup sup 

n ^°° c t 



sd 



*(t) 



< lim sup sup 

c t 



I "' /2 ^c < t 
sd 



completing the proof. 



□ 



Lemma A. 2. Let Sy(C) &e defined as in ffW . Under the assumptions of Theorem \2.1[ 

S V (C) 



sup P I 
c V 



n sd 



> e -»-0 
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Proof. For any e > 0, since Vi(C) are independent, 

supP ( n-^ 2 \J2 V i( C )\ > e ) < sup^P (|^(C)| > er^r" 1 ) 



i=l 



t=l 



c — e 4 n 2 r n 4 



where the last inequality follows from the fact that E(V^(C)) < K(p 2 ) 2 . 



□ 



Lemma A. 3. Let Su{C) and SV(C) be defined as in ( TTgj) . Under the assumptions of The 
orem lK7[ 

\S V (C) 



sup 
c 



^(C) < g y (C) 



rase? 



nsd 



rase? 



-> 0. 



(26) 



Proof. Note that, using the Berry Essen bound (see Shorack (Probability for Statisticians)), 
and the fact that Var( Sv ^ ) — > sd 2 as shown in the proof of Theorem 12. 1[ we have 



sup sup 
c t 



< 9 sup 



TZiE\Ui{C)\ 3 

3/2 



c E[:iVar(C/ i (C))] 



o(l). (27) 



Let t* = t - ^H, then 

VnSd ' 



sup 
c 



sup 
c 



(Su(C) <t Sy(C)^ 
\ y/nsd ~ y^sd ' 

\ y^nsd 



Sv(C) 



< t\ 



Sv(C) 



^/nsd 
< e ) - P 



<e -$(t) 



\ v^risd 



p ^(C) 



\ -y/nsd 



A/nsd 

<r ) -$(t*) + $(t*)-$(t) 



< sup 
c 



p 



\ y/nsd 



Sv(C) 



A/nsd 



< e - P 



Sn(C) 



+ sup 
c 



\ v^sd 



(28) 



v^sd 



The first term in ( 1281) goes to zero by continuity of measures, since by Lemma lA.2I P 
1. The second term in ( 128|) goes to zero by (1271) . and the third term goes to zero by the 
continuity of $(.)• 



□ 
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Lemma A. 4. Let X±, . . . ,X n be iidfF], and let F n (x) be the corresponding empirical distri- 
bution function. Then, for any constant c, 



sup Xi;X . \F( Xi ) - F( Xj )\I \F{ Xi ) - F{ Xj )\ < - 



Proof. By the Dvoretzky, Kiefer and Wolfowitz (1956) theorem, we have that Ve > 0, 

P fsup \F n (x) - F(x)\ > e\ < Ce~ 2ne \ 
Therefore, \F(x) — F(x)\ = O p uniformly on x. Hence, writing 

\F( Xi ) - F( Xj )\ = \F(xi) - F n { Xi ) + F n ( Xl ) - F{ Xj ) + F n { Xj ) - F n ( Xj )\, 
it follows that sup XuXj — F(xj)\I \F( Xi ) — F(xj)\ < c/n j is less than or equal to 

sup Xi)Xj jl-F(xi) - F n ( Xi ) \ + \F n (xj) - F(xj)\\ 



sup XuXj \ \F n (xi) - F n (xj)\ \ I \F n ( Xi ) - F n (xj)\ < - 



O r , 



1 



Or 



1 



<n J \V n 
This completes the proof of the lemma. 



Or, 



n 



□ 



Lemma A. 5. With Wi be defined in (Tj|), and any Lipschitz continuous function g(x), 



1 n / i \ 

- V g(x 2j )I(j e Wi) - g(x 2i ) = O p [-= ), 



uniformly in i = 1, . . . , n . 



Proof. First note that by the Lipschitz continuity and the Mean Value Theorem we have 

\g(x2j) ~ g(x2i)\ < M\x 2 j - X 2i \ < M\Fx 2 (x 2 j) - Fx 2 (x 2i )\/fx 2 (Xij), 



for some constant M, where between x 2j and x 2i . Thus, 

^ n 1 n 

- ^2g(x 2j )I(j e Wi) - g{x 2i ) < - \g(%2j) - g(x 2i )\I 



< 



|^x 2 (s 2 i) - F X2 (x 2j )\ < 



p — 1 
2n 



M" \F X2 (x 2j )-Fx 2 (x 2i )\ l 
fx 2 (xij) 



V 



i=i 



\F X2 (x 2i ) -F X2 (x 2j )\ < 



p — 1 

2n 
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where the last equality follows from Lemma IA.4I and the assumption that fx 2 remains 
bounded away from zero. □ 
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